Below we’ve transcribed the assignment instructions with corresponding code blocks beneath each.
🚨Please note,
With the exception of headers and the above, writing in the normal font is ours, and bold/underlined writing is copy-pasted from the instructions.
Code blocks were directly excerpted, adapted, and reused from previously provided instructions / guidance. We do not believe this to be plagiarism and hope that in declaring it up front we demonstrate our honest intent. The sources used are found in the Assignment 2 instructions (which are transcribed fully in the Assignment 2 folder in our repo).
Task 5: Annotating tables, has been performed inline, in a more traditional notebook manner. We do not reserve the annotations for step 5 and instead hope to meet the criteria by elaborating on all provided images and tables adjacently to each.
Please insert your data files in the data directory such that the single folder from refine.bio containing the metadata TSV and main TSV file are present in the immediate subdirectory. This can be downloaded from google drive, here!
This next step loads data and annotation libraries per the instructions, but does not attempt to transform our Gene column to the Entrez IDs as this was unnecessary for downstream processing.
# Check that files exist and are usable.
file.exists(data_file)
## [1] TRUE
file.exists(metadata_file)
## [1] TRUE
Below we read in the data but we also take additional measures to trim out ambiguously-labeled samples to give clusters a better chance at lining up with our diseased/not-diseased groupings in the final statistics step.
# visualize ideal K for K-means clustering using the elbow method on within-sum-squares metric
if(!require("purrr"))
install.packages("purrr")
if(!require('factoextra'))
install.packages('factoextra')
set.seed(123)
#Expect a delay, takes a bit
fviz_nbclust(top_5000, kmeans, k.max = 6, method = "wss")
# using the elbow method, the optimal number of clusters appears to be ~3
The elbow method visualized above reruns K-Means with different amounts of clusters and visualizes how “Total within sum of square” decreases. This “within sum of square” metric explains proximity of points to their centroid. Lower values mean tighter groupings. The elbow method means we look for where the elbow in the curve occurs, picking a number of clusters K that does not overfit the data, but also provides a significant reduction in WSS compared to K-1 clusters. Here we choose 3 as optimal.
#about 20 seconds
kmeans_result_5000_three <- kmeans(top_5000,3,iter.max = 10, nstart = 25)
kmeans_result_5000_four <- kmeans(top_5000,4,iter.max = 10, nstart = 25)
kmeans_result_5000_five <- kmeans(top_5000,5,iter.max = 10, nstart = 25)
#About 30 seconds
fviz_cluster(kmeans_result_5000_three, data = top_5000, geom="point")
Above we can see how these 3 clusters look when displayed in a 2D graph via PCA. This is a healthy “sanity-check” indicating our choice to use four clusters (per elbow method) is indeed acceptable. We do note that the graph does appear to show overlap between these three c
If you ran a method that requires you to select the number of clusters (k), how did changing this value change the results? Compare cluster membership at each k to investigate this.
The K-means clustering method required me to select the number of clusters. To determine the ideal “k”, I utilized the “factoextra” package to generate a graph showing the number of clusters versus the total within the sum of square. Using the elbow method, I determined that the optimal number of clusters was 2, and to test how changing the value changed the results, ran the “kmeans” clustering command for a k of 3,4, and 5 to determine how changing k changed the results.
Cluster membership at 3, 4, and 5 was:
3 - 843, 493, 265
4 - 591, 491, 308, 211
5 - 491, 370, 299, 243, 198,
At a k of 3, reduction in sum of squares was 4.9%, and at a k of 4, it rose only slightly to 5.5 %. At a k of 5, this rises only slightly to 5.9 %. These results confirm the results of the fviz_nbclust command, showing that there are diminishing returns in reduction in sum of squares as the number of clusters increases. By looking at cluster membership at k = 3,4, and 5 we can see that cluster membership does not change drastically between 3 and 4, as it is the largest cluster that ends up splitting to form the fourth cluster. Meanwhile, between a k of 4 and 5, the new cluster takes away members from other clusters, not just the largest.
# rerun K-Means clustering method with different number of genes AND K values for the alluvial down the line
kmeans_result_10_three <- kmeans(top_10,3,iter.max = 10,nstart=25)
kmeans_result_10_four <- kmeans(top_10,4,iter.max = 10,nstart=25)
kmeans_result_10_five <- kmeans(top_10,5,iter.max = 10,nstart=25)
kmeans_result_100_three <- kmeans(top_100,3,iter.max = 10,nstart=25)
kmeans_result_100_four <- kmeans(top_100,4,iter.max = 10,nstart=25)
kmeans_result_100_five <- kmeans(top_100,5,iter.max = 10,nstart=25)
kmeans_result_1000_three <- kmeans(top_1000,3,iter.max = 10,nstart=25)
kmeans_result_1000_four <- kmeans(top_1000,4,iter.max = 10,nstart=25)
kmeans_result_1000_five <- kmeans(top_1000,5,iter.max = 10,nstart=25)
kmeans_result_10000_three <- kmeans(top_10000,3,iter.max = 10,nstart=25)
kmeans_result_10000_four <- kmeans(top_10000,4,iter.max = 10,nstart=25)
kmeans_result_10000_five <- kmeans(top_10000,5,iter.max = 10,nstart=25)
#print(kmeans_result_10000_three)
How did the number of genes affect clustering?
Cluster Membership at 10, 100, 1000, and 10,000 (with a k of 3) was:
10 - 612, 520, 469
100 - 760, 474, 367
1000 - 797, 500, 304
10,000 - 852, 492, 257
Looking solely at cluster membership, it appears that each time the number of genes increased, membership in the largest cluster steadily increased while membership in the other clusters varyingly decreased and increased. As the number of genes increased, a reduction in sum of squares at a k of 3 was calculated as 34.8 %, 11.9 %, 7.5 %, and 4.2 % at 10, 100, 1,000, and 10,000, respectively. As the number of genes increased, the reduction in sum of squares greatly decreased as well.
alluvialData <- data.frame(
K3 = kmeans_result_10_three$cluster,
K4 = kmeans_result_10_four$cluster,
K5 = kmeans_result_10_five$cluster
)
ggplot(data = alluvialData, aes(axis1 = K3, axis2 = K4, axis3 = K5)) +
geom_alluvium(aes(fill = K3)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal() +
labs(title = "KMeans for 10 Samples at K=3,4,5")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
The first of the alluvial plots here shows only 10 genes worth of data. We can see how the samples in cluster 1 bifurcate early, while the samples of cluster 2 at K=4 column bifurcate into groups 3 and 5 at K=5. Nothing overwhelmingly insightful here, but lets see what happens with more genes.
alluvialData <- data.frame(
K3 = kmeans_result_100_three$cluster,
K4 = kmeans_result_100_four$cluster,
K5 = kmeans_result_100_five$cluster
)
ggplot(data = alluvialData, aes(axis1 = K3, axis2 = K4, axis3 = K5)) +
geom_alluvium(aes(fill = K3)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal() +
labs(title = "KMeans for 100 Genes at K=3,4,5")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
Things get more complicated with more genes. We can see the pattern of bifurcations appears to be significantly different although of some intrigue is the relatively robust group 2 chosen at K=3 (first column) which seems to barely bifurcate at all (a small fraction goes to group 2 at K=5, last column). Perhaps this group is the mostly tight, whereas clearly the grey bands of group 1 from K3 (again, first column) are going many places indicating it was a poor discriminant and likely of lesser significance.
alluvialData <- data.frame(
K3 = kmeans_result_1000_three$cluster,
K4 = kmeans_result_1000_four$cluster,
K5 = kmeans_result_1000_five$cluster
)
ggplot(data = alluvialData, aes(axis1 = K3, axis2 = K4, axis3 = K5)) +
geom_alluvium(aes(fill = K3)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal() +
labs(title = "KMeans for 1000 Genes at K=3,4,5")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
Again the pattern changes dramatically. This is unsurprising given we are adding more highly variable genes to the mix. Little more is to be said.
alluvialData <- data.frame(
K3 = kmeans_result_10000_three$cluster,
K4 = kmeans_result_10000_four$cluster,
K5 = kmeans_result_10000_five$cluster
)
ggplot(data = alluvialData, aes(axis1 = K3, axis2 = K4, axis3 = K5)) +
geom_alluvium(aes(fill = K3)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal() +
labs(title = "KMeans for 10,000 Genes at K=3,4,5")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
This last diagram is perhaps the most significant in examining groups since it is based on the most data. Cluster 2(K3) is exceptional here and bifurcates minimally indicating a high “goodness of clustering”.
Note, ~1 minute runtime on Macbook Air M2 (known for CPU), expect delay.
# 5000 most variable genes Hierarchical Clustering
distance <- dist(top_5000, method = "euclidean")
hcluster <- hclust(distance, method = "complete")
# 1 cluster
hcluster_result_5000_two <- cutree(hcluster, k = 2)
# 2 clusters
hcluster_result_5000_three <- cutree(hcluster, k = 3)
# 3 clusters
hcluster_result_5000_four <- cutree(hcluster, k = 4)
# Visualize dendrogram
h <- plot(hcluster, cex = 0.8, hang = -1)
#rect.hclust(hcluster, k = 2, border = 2:5)
# If you ran a method that requires you to select the number of clusters (k), how did changing # this value change the results? Compare cluster membership at each k to investigate this.
# Visualize clusters
fviz_cluster(list(data = top_5000, cluster = hcluster_result_5000_two), geom = "point")
fviz_cluster(list(data = top_5000, cluster = hcluster_result_5000_three), geom = "point")
fviz_cluster(list(data = top_5000, cluster = hcluster_result_5000_four), geom = "point")
This set of graphs is quite interesting. The two cluster graph amusingly displays what appears to be a single cluster, likely another folly of the 2D world. The three cluster graph shows how the larger visible cluster has bifurcated (this time quite literally given the nature of hierarchical clustering). The four cluster graph further divides these two larger clusters, though it is unclear from which the points of cluster 1 originate, but the extremely tight cluster (now cluster 4) remains centered and nearly invisible.
Expect further delay (~1 mins?)
# HClust Top 10 fails to work here, as there is presumably not enough data? The origin of the issue is unclear but the dendrogram is deemed invalid.
# 10 most variable genes Hierarchical Clustering
#distance <- dist(top_10, method = "euclidean")
#hclusterTop10 <- hclust(distance, method = "complete")
#plot(hclusterTop10, cex = 0.8, hang = -1, labels = TRUE)
#hcluster_result_10_two <- cutree(hclusterTop10, k = 2)
# 100 most variable genes Hierarchical Clustering
distance <- dist(top_100, method = "euclidean")
hclusterTop100 <- hclust(distance, method = "complete")
plot(hclusterTop100, cex = 0.8, hang = -1, labels = FALSE)
hcluster_result_100_two <- cutree(hclusterTop100, k = 2)
# 1000 most variable genes Hierarchical Clustering
distance <- dist(top_1000, method = "euclidean")
hclusterTop1000 <- hclust(distance, method = "complete")
plot(hclusterTop1000, cex = 0.8, hang = -1, labels = FALSE)
hcluster_result_1000_two <- cutree(hclusterTop1000, k = 2)
# 10000 most variable genes Hierarchical Clustering
distance <- dist(top_10000, method = "euclidean")
hclusterTop10000 <- hclust(distance, method = "complete")
plot(hclusterTop10000, cex = 0.8, hang = -1, labels = FALSE)
hcluster_result_10000_two <- cutree(hclusterTop10000, k = 2)
The dendrograms above help to illustrate the tree being produced by running variations of hierarchical clustering. Of interest is the very large offshoot of the majority of the genes segregated away from smaller groups in the 10,000 gene dendrogram. This will have consequences on the alluvial visuals below, and raises questions about logarithmically scaling the width of the bands alluvial diagrams produce, rather than linearly.
Due to the nature of hierarchical clustering making the alluvial diagrams exceptionally disinteresting, we’ve opted to highlight only alluvials at 10 genes and 10,000 genes but with more divisions (you’re more or less looking at a horizontal dendrogram…)
hcluster_result_100_two <- cutree(hclusterTop100, k = 2)
hcluster_result_100_three <- cutree(hclusterTop100, k = 3)
hcluster_result_100_four <- cutree(hclusterTop100, k = 4)
hcluster_result_100_five <- cutree(hclusterTop100, k = 5)
hcluster_result_10000_two <- cutree(hclusterTop10000, k = 2)
hcluster_result_10000_three <- cutree(hclusterTop10000, k = 3)
hcluster_result_10000_four <- cutree(hclusterTop10000, k = 4)
hcluster_result_10000_five <- cutree(hclusterTop10000, k = 5)
alluvial_data <- data.frame(
K2 = hcluster_result_100_two,
K3 = hcluster_result_100_three,
K4 = hcluster_result_100_four,
K5 = hcluster_result_100_five
)
ggplot(data = alluvial_data, aes(axis1 = K2, axis2 = K3, axis3 = K4, axis4 = K5)) +
geom_alluvium(aes(fill = K2)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal() +
labs(title = "Hierarchical Clustering across different values of K at 10 genes")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
alluvial_data <- data.frame(
K2 = hcluster_result_10000_two,
K3 = hcluster_result_10000_three,
K4 = hcluster_result_10000_four,
K5 = hcluster_result_10000_five
)
ggplot(data = alluvial_data, aes(axis1 = K2, axis2 = K3, axis3 = K4, axis4 = K5)) +
geom_alluvium(aes(fill = K2)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_minimal() +
labs(title = "Hierarchical Clustering across different values of K at 10000 genes")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
There are two diagrams here, the first showing almost nothing of interest because as per the nature of hierarchical clustering, groups are perfectly divided and do not allow samples to cross over to other branches halfway down the tree. Thus from K2-->3 to 3, 2 bifurcates to 2/3, from K3-->4, 1 bifurcates to 1 and 2, and so on. Very simple. As for the second alluvial…
You’re probably thinking “that doesn’t look right”. Au contraire. If we look at the dendrogram for 10,000 genes we see how the ensuing splits in the hierarchy take place on the right side with one giant arm reaching to the left This is the “cluster 2” which is renamed to cluster 4 on the final column by chance, but represents this large undivided branch of the tree that is only subdivided later in the hierarchy.
library(cluster)
#Center data around median gene expressions in each row, without touching OG data. This came from the ConsensusClusterPlus guide, though "Scale" which the internet recommended for use with PAM would accomplish the same thing just with means instead of medians.
top_10m = sweep(top_10,1, apply(top_10,1,median,na.rm=T))
top_100m = sweep(top_100,1, apply(top_100,1,median,na.rm=T))
top_1000m = sweep(top_1000,1, apply(top_1000,1,median,na.rm=T))
top_5000m = sweep(top_5000,1, apply(top_5000,1,median,na.rm=T))
top_10000m = sweep(top_10000,1, apply(top_10000,1,median,na.rm=T))
#Takes quite a while at 10,000
k <- 3
pam_result_10_three <- pam(top_10m, k = k)
pam_result_100_three <- pam(top_100m, k = k)
pam_result_1000_three <- pam(top_1000m, k = k)
pam_result_5000_three <- pam(top_5000m, k = k)
pam_result_10000_three <- pam(top_10000m, k = k)
🚨The instructions linked to PAM suggested the use of a dissimilarity matrix but I couldnt get fviz and other operations downstream to work with it, so im just doing PAM the classic way (it generates the dissimilarity matrix). I suspect this is why the run time is so crippling. Please consider enjoying some popcorn or perhaps painting the nearest surface before hitting run on this next cell, so that you may enjoy some semblance of stimulation. Your RAM usage graph may also be amusing to watch.
#Let's compare K values:
cluster_counts <- table(pam_result_5000_three$clustering)
cluster_counts
##
## 1 2 3
## 469 291 293
pam_result_5000_four <- pam(top_5000m, k = 4)
cluster_counts <- table(pam_result_5000_four$clustering)
cluster_counts
##
## 1 2 3 4
## 388 285 97 283
pam_result_5000_five <- pam(top_5000m, k = 5)
cluster_counts <- table(pam_result_5000_five$clustering)
cluster_counts
##
## 1 2 3 4 5
## 338 101 236 95 283
pam_result_5000_six <- pam(top_5000m, k = 6)
cluster_counts <- table(pam_result_5000_six$clustering)
cluster_counts
##
## 1 2 3 4 5 6
## 335 101 236 92 80 209
fviz_cluster(pam_result_5000_three, geom = "point")
fviz_cluster(pam_result_5000_four, geom = "point")
fviz_cluster(pam_result_5000_five, geom = "point")
fviz_cluster(pam_result_5000_six, geom = "point")
We can see how membership changes above, at least in the two chosen dimensions for the visualization (since we’re operating on 5000-dimensional data). Indeed it seems that more clusters reduce the overlap, but the extent to which they reduce the overlap is better demonstrated through the sillhouette method shown below, because, amusingly, the new clusters introduced in the 2D field above show literally 0 reduction in overlap. 4 of the generated clusters overlap almost perfectly with eachother, implying the PCA has failed to adequately visualize these groups.
🚨Note, takes ~4 minutes on macbook air M2.
#Takes around 4 minutes on a macbook air M2
library(factoextra)
fviz_nbclust(top_5000m, pam, k.max = 6, verbose = TRUE, print.summary = TRUE, method = "silhouette")
#should identify 2 as optimal K value, which is a function of sillhouette score
The sillhouette method used above is looking at all the points resulting from different K values of PAM, giving them a score (based on their proximity to their assigned cluster and distance to other clusters), and then averaging the score accross all points to determine a total score that represents discrimination. We’d expect this to increase with more clusters but alas it seems 1 cluster is locally optimal in effectively segregating our samples. This is pretty odd, but, the numbers dont seem to lie but the ensuing sillhouette scores aren’t terribly worse, so I’ve chosen to show k=5 below.
Below I visualize the sillhouette score. We can see the samples in the two clusters that overlap, they are relatively few (the part of the red and blue chunk below 0). The dotted red line represents the average score, which is excellent. The X axis is individual samples though I’ve removed the label to keep things clean!
pam_result_5000_five <- pam(top_5000m, k = 2)
fviz_silhouette(pam_result_5000_five, label=FALSE)
## cluster size ave.sil.width
## 1 1 760 0.04
## 2 2 293 0.05
This sillhouette chart is hillarious. The loss in sillhouette score comes entirely from the red group which refuses to be separated by PAM whilst the other groups, especially pink (5), are exceptionally segregated with values almost exclusively in the positive domain. Perhaps if more clusters were added, the global or at least a new local optima K value of, say, 15 or 20 would be found, but for now monolithic cluster 1 is dragging the score down.
Finally lets look at an alluvial. Since 10,000 for PAM is very computationally expensive, I will instead show an alluvial at 5000 for many K values as we already computed these above.
top_5000G <- top_5000 %>%
tibble::rownames_to_column("Gene")
alluvialData <- data.frame(
K3 = pam_result_5000_three$clustering,
K4 = pam_result_5000_four$clustering,
K5 = pam_result_5000_five$clustering,
K6 = pam_result_5000_six$clustering
)
ggplot(
data = alluvialData,
aes(axis1 = K3, axis2 = K4, axis3 = K5, axis4 = K6, axis5 = K6)
) +
geom_alluvium(aes(fill = K6)) +
geom_stratum() +
geom_text(stat = "stratum", aes(label = after_stat(stratum))) +
theme_classic() +
xlab("Amount of Clusters") +
ylab("Gene Index") +
labs(title = "PAM for 5000 Samples at K=3,4,5,6")
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
## Warning in to_lodes_form(data = data, axes = axis_ind, discern =
## params$discern): Some strata appear at multiple axes.
Nasty. We can see the monolithic cluster squeezing the other groups
away, and making me ponder if a logarithimically scaled cluster size
visualization would be preferable. Alas.
This speaks both to the significant goodness-of-cluster that this monolith is exhibiting and to the robust cluster 2 which also undergoes minimal subdivision in higher K values.
A lesson here is that the marginalized groups in an alluvial are often intriguing, making it a poor visualization for those interesting outliers (though they capture macroscopic trends that are still surely significant and more likely to be relevant to the disease or affliction we’re usually trying to capture in bioinformatics).
Heatmaps and Dendrograms
#3 Gene Heatmap
# Placeholder
mat_100<- data.matrix(top_1000,rownames.force = FALSE)
ht = ComplexHeatmap::Heatmap(mat_100,heatmap_legend_param = list(
title = "Expression", at = c(0, 50, 100,150)
),column_title = "Clustered Genes",show_column_names = FALSE,cluster_columns = TRUE,cluster_rows = TRUE)
ComplexHeatmap::draw(ht)
#
# annotation_data <- data.frame(
# kmeans100four = as.factor(kmeans_result_100_four$cluster),
# kmeans1000five = as.factor(kmeans_result_1000_five$cluster),
# ReferenceGroup = as.factor(reference_group)
# )
#
# pheatmap(
# expression_df,
# annotation_col = annotation_data,
# show_rownames = FALSE,
# show_colnames = FALSE
# )
To generate the above heatmap, we utilized the ComplexHeatmap library again using the top 5,000 genes as instructed. We provided a legend based on gene expression, and to cluster the genes within the heatmap, we set the cluster_columns and cluster_rows parameters to “TRUE”. This also displays the dendrograms for the rows and columns. Unfortunately, we were unable to cluster the heatmap according to the clusters we determined in earlier sections of the assignment, and so were unable to annotate sections for said columns. While clusters data do clearly exist, the heatmap unfortunately does not clearly reflect this fact.
Statistics
# Alright, let's prep the metadata to better select what we need.
clusterResults <- list(
"KMEANS_K5_N10" = kmeans_result_10_five$cluster,
"KMEANS_K4_N10" = kmeans_result_10_four$cluster,
"KMEANS_K3_N10" = kmeans_result_10_three$cluster,
"KMEANS_K5_N100" = kmeans_result_100_five$cluster,
"KMEANS_K4_N100" = kmeans_result_100_four$cluster,
"KMEANS_K3_N100" = kmeans_result_100_three$cluster,
"KMEANS_K5_N1000" = kmeans_result_1000_five$cluster,
"KMEANS_K4_N1000" = kmeans_result_1000_four$cluster,
"KMEANS_K3_N1000" = kmeans_result_1000_three$cluster,
"KMEANS_K5_N10000" = kmeans_result_10000_five$cluster,
"KMEANS_K4_N10000" = kmeans_result_10000_four$cluster,
"KMEANS_K3_N10000" = kmeans_result_10000_three$cluster,
"KMEANS_K5_N5000" = kmeans_result_5000_five$cluster,
"KMEANS_K4_N5000" = kmeans_result_5000_four$cluster,
"KMEANS_K3_N5000" = kmeans_result_5000_three$cluster,
"HCLUSTER_K5_N100" = hcluster_result_100_five,
"HCLUSTER_K4_N100" = hcluster_result_100_four,
"HCLUSTER_K3_N100" = hcluster_result_100_three,
"HCLUSTER_K2_N100" = hcluster_result_100_two,
"HCLUSTER_K5_N10000" = hcluster_result_10000_five,
"HCLUSTER_K4_N10000" = hcluster_result_10000_four,
"HCLUSTER_K3_N10000" = hcluster_result_10000_three,
"HCLUSTER_K2_N10000" = hcluster_result_10000_two,
"HCLUSTER_K4_N5000" = hcluster_result_5000_four,
"HCLUSTER_K3_N5000" = hcluster_result_5000_three,
"HCLUSTER_K2_N5000" = hcluster_result_5000_two,
"PAM_K3_N10" = pam_result_10_three$clustering,
"PAM_K3_N100" = pam_result_100_three$clustering,
"PAM_K3_N1000" = pam_result_1000_three$clustering,
"PAM_K3_N10000" = pam_result_10000_three$clustering,
"PAM_K3_N5000" = pam_result_5000_three$clustering,
"PAM_K4_N5000" = pam_result_5000_four$clustering,
"PAM_K5_N5000" = pam_result_5000_five$clustering,
"PAM_K6_N5000" = pam_result_5000_six$clustering
)
Now we need to trim our “culledMeta” to only the columns we care about: accession code, and diabetic vs nondiabetic.
#
referenceGroup <- subset(culledMeta, select = c("refinebio_accession_code", "diabetes"))
referenceGroup <- referenceGroup %>%
dplyr::mutate(diabetes = dplyr::case_when(
stringr::str_detect(diabetes, "reference") ~ 0,
stringr::str_detect(diabetes, "diabetic") ~ 1,
))
reference_group <- referenceGroup$diabetes
results_df <- data.frame(
Name = character(),
X_Squared = numeric(),
Degrees_of_Freedom = integer(),
P_Value_Adjusted = numeric(),
stringsAsFactors = FALSE
)
# Loop through each matrix
for(name in names(clusterResults)) {
test_result <- chisq.test(clusterResults[[name]])
x_squared <- ifelse(is.null(test_result$statistic), 0, test_result$statistic)
df <- ifelse(is.null(test_result$parameter), 0, test_result$parameter)
p_value <- ifelse(is.null(test_result$p.value), -1, p.adjust(test_result$p.value))
temp_df <- data.frame(
Name = name,
X_Squared = x_squared,
Degrees_of_Freedom = df,
P_Value_Adjusted = p_value
)
results_df <- rbind(results_df, temp_df)
}
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
## Warning in chisq.test(clusterResults[[name]]): Chi-squared approximation may be
## incorrect
print(results_df)
## Name X_Squared Degrees_of_Freedom P_Value_Adjusted
## 1 KMEANS_K5_N10 765.8131 1052 1.0000000000
## 2 KMEANS_K4_N10 428.9811 1052 1.0000000000
## 3 KMEANS_K3_N10 379.2207 1052 1.0000000000
## 4 KMEANS_K5_N100 846.5928 1052 0.9999991795
## 5 KMEANS_K4_N100 491.4929 1052 1.0000000000
## 6 KMEANS_K3_N100 396.9655 1052 1.0000000000
## 7 KMEANS_K5_N1000 668.6595 1052 1.0000000000
## 8 KMEANS_K4_N1000 448.9761 1052 1.0000000000
## 9 KMEANS_K3_N1000 243.6913 1052 1.0000000000
## 10 KMEANS_K5_N10000 494.6049 1052 1.0000000000
## 11 KMEANS_K4_N10000 484.4074 1052 1.0000000000
## 12 KMEANS_K3_N10000 344.3206 1052 1.0000000000
## 13 KMEANS_K5_N5000 706.0422 1052 1.0000000000
## 14 KMEANS_K4_N5000 573.0656 1052 1.0000000000
## 15 KMEANS_K3_N5000 254.3345 1052 1.0000000000
## 16 HCLUSTER_K5_N100 1123.0592 1052 0.0630019421
## 17 HCLUSTER_K4_N100 731.8579 1052 1.0000000000
## 18 HCLUSTER_K3_N100 461.2623 1052 1.0000000000
## 19 HCLUSTER_K2_N100 180.6631 1052 1.0000000000
## 20 HCLUSTER_K5_N10000 737.5934 1052 1.0000000000
## 21 HCLUSTER_K4_N10000 163.7260 1052 1.0000000000
## 22 HCLUSTER_K3_N10000 157.7638 1052 1.0000000000
## 23 HCLUSTER_K2_N10000 154.7510 1052 1.0000000000
## 24 HCLUSTER_K4_N5000 502.7748 1052 1.0000000000
## 25 HCLUSTER_K3_N5000 180.3102 1052 1.0000000000
## 26 HCLUSTER_K2_N5000 178.6978 1052 1.0000000000
## 27 PAM_K3_N10 368.1386 1052 1.0000000000
## 28 PAM_K3_N100 383.8800 1052 1.0000000000
## 29 PAM_K3_N1000 346.1091 1052 1.0000000000
## 30 PAM_K3_N10000 301.7892 1052 1.0000000000
## 31 PAM_K3_N5000 399.6943 1052 1.0000000000
## 32 PAM_K4_N5000 683.3583 1052 1.0000000000
## 33 PAM_K5_N5000 165.4383 1052 1.0000000000
## 34 PAM_K6_N5000 1199.9504 1052 0.0009677833
for (name in names(clusterResults)) {
tbl <- table(clusterResults[[name]], reference_group)
test <- chisq.test(tbl)
test$p.value <- p.adjust(test$p.value)
cat("Result for", name, ":\n")
print(test)
cat("\n-------------------------------\n")
}
## Result for KMEANS_K5_N10 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 36.982, df = 4, p-value = 1.817e-07
##
##
## -------------------------------
## Result for KMEANS_K4_N10 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 7.4321, df = 3, p-value = 0.05933
##
##
## -------------------------------
## Result for KMEANS_K3_N10 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 4.963, df = 2, p-value = 0.08362
##
##
## -------------------------------
## Result for KMEANS_K5_N100 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 157.28, df = 4, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K4_N100 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 159.33, df = 3, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K3_N100 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 145.22, df = 2, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K5_N1000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 157.64, df = 4, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K4_N1000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 165.35, df = 3, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K3_N1000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 158.39, df = 2, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K5_N10000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 244.09, df = 4, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K4_N10000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 214.61, df = 3, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K3_N10000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 168.87, df = 2, p-value < 2.2e-16
##
##
## -------------------------------
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Result for KMEANS_K5_N5000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 227.76, df = 4, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K4_N5000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 181.93, df = 3, p-value < 2.2e-16
##
##
## -------------------------------
## Result for KMEANS_K3_N5000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 182.79, df = 2, p-value < 2.2e-16
##
##
## -------------------------------
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Result for HCLUSTER_K5_N100 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 98.745, df = 4, p-value < 2.2e-16
##
##
## -------------------------------
## Result for HCLUSTER_K4_N100 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 97.41, df = 3, p-value < 2.2e-16
##
##
## -------------------------------
## Result for HCLUSTER_K3_N100 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 73.327, df = 2, p-value < 2.2e-16
##
##
## -------------------------------
## Result for HCLUSTER_K2_N100 :
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 2.5675, df = 1, p-value = 0.1091
##
##
## -------------------------------
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Result for HCLUSTER_K5_N10000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 91.018, df = 4, p-value < 2.2e-16
##
##
## -------------------------------
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Result for HCLUSTER_K4_N10000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 7.3542, df = 3, p-value = 0.06143
##
##
## -------------------------------
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Result for HCLUSTER_K3_N10000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 7.2395, df = 2, p-value = 0.02679
##
##
## -------------------------------
## Result for HCLUSTER_K2_N10000 :
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 5.149, df = 1, p-value = 0.02326
##
##
## -------------------------------
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Result for HCLUSTER_K4_N5000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 121.88, df = 3, p-value < 2.2e-16
##
##
## -------------------------------
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Result for HCLUSTER_K3_N5000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 0.14406, df = 2, p-value = 0.9305
##
##
## -------------------------------
## Result for HCLUSTER_K2_N5000 :
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 0.0049344, df = 1, p-value = 0.944
##
##
## -------------------------------
## Result for PAM_K3_N10 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 16.229, df = 2, p-value = 0.0002992
##
##
## -------------------------------
## Result for PAM_K3_N100 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 91.235, df = 2, p-value < 2.2e-16
##
##
## -------------------------------
## Result for PAM_K3_N1000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 5.1362, df = 2, p-value = 0.07668
##
##
## -------------------------------
## Result for PAM_K3_N10000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 22.944, df = 2, p-value = 1.042e-05
##
##
## -------------------------------
## Result for PAM_K3_N5000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 59.954, df = 2, p-value = 9.576e-14
##
##
## -------------------------------
## Result for PAM_K4_N5000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 91.076, df = 3, p-value < 2.2e-16
##
##
## -------------------------------
## Result for PAM_K5_N5000 :
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 34.378, df = 1, p-value = 4.538e-09
##
##
## -------------------------------
## Result for PAM_K6_N5000 :
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 90.859, df = 5, p-value < 2.2e-16
##
##
## -------------------------------
These P values are pretty atrocious. It is safe to assume that the groups chosen are indeed independent. This may have to do with the fact that we chose the most variably expressed genes, not the most expressed genes period, which generally picks the genes noisiest accross our samples and thus yields relatively poor clusters (as seen by the oddly low amount of clusters chosen by sillhouette method or the lack of an elbow in the WSS kmeans evaluation).
Write a short summary for each plot/table you created in this assignment. In 3-5 sentences for each, describe what you did, what parameters you used (if any) and an interesting result from it.
✅
As a team, fill out the team evaluation table below.
✅
Combine all results into a single file, submit on Canvas. Make sure that all your code is added to your GitHub repository.
✅